1 Introduction

In this notebook we look at fitting generalized linear models with elastic net regularization. Elastic net models are a form of penalized regression. The elastic net penalty is a linear combination of the \(L_1\) (lasso) and \(L_2\) (ridge regression) penalties.

The elastic net penalty is of the form

\[ \alpha \ell_1 + (1 - \alpha) \ell_2 \] where \(\ell_1\) is the \(L_1\) penalty and \(\ell_2\) is the \(L_2\) penalty. Hence the lasso and ridge regression are special cases of the elastic net penalty. In particular, \(\alpha = 1\) corresponds to lasso, and \(\alpha = 0\) corresponds to ridge regression.

Elastic net regularization offers advantages over the standard generalized linear modeling. Estimates from optimal elastic net models will often be more stable than those from an ordinary generalized linear model, with lower overall error. In the bias-variance tradeoff, the benefit of reduced variance can outweigh the introduction of bias into estimates.

Elastic net models can also more easily handle correlated predictors and categorical predictors with larger numbers of levels than an ordinary generalized linear model, which may require more manual touching and review in such cases.

At the same time, elastic net models offset the exact same advantages with respect to interpretability and ease of implementation as the ordinary GLM. Elastic net models may thus provide a valuable tool for the insurance modeler.

2 Setup

In this document, we will use a sample data of French personal auto policies from an unknown insurer in the data set freMPL1 in the CASdatasets package.

library(CASdatasets)
library(dplyr)
library(ggplot2)
library(h2o)

set.seed(256)

source("code/elastic_net_models/model_functions.R")

data(freMPL1)

Here is a look at a few rows of the data.

as_tibble(freMPL1)
## # A tibble: 30,595 x 22
##    Exposure LicAge RecordBeg  RecordEnd  VehAge Gender MariStat SocioCateg
##       <dbl>  <int> <date>     <date>     <fct>  <fct>  <fct>    <fct>     
##  1    0.583    366 2004-06-01 NA         2      Female Other    CSP1      
##  2    0.2      187 2004-10-19 NA         0      Male   Alone    CSP55     
##  3    0.083    169 2004-07-16 2004-08-16 1      Female Other    CSP1      
##  4    0.375    170 2004-08-16 NA         1      Female Other    CSP1      
##  5    0.5      224 2004-01-01 2004-07-01 3      Male   Other    CSP47     
##  6    0.499    230 2004-07-01 NA         3      Male   Other    CSP47     
##  7    0.218    169 2004-01-01 2004-03-20 6-7    Male   Other    CSP50     
##  8    0.75     232 2004-01-01 2004-10-01 4      Female Other    CSP55     
##  9    0.249    241 2004-10-01 NA         5      Female Other    CSP55     
## 10    0.75     298 2004-01-01 2004-10-01 2      Male   Other    CSP50     
## # … with 30,585 more rows, and 14 more variables: VehUsage <fct>,
## #   DrivAge <int>, HasKmLimit <int>, BonusMalus <int>, VehBody <fct>,
## #   VehPrice <fct>, VehEngine <fct>, VehEnergy <fct>, VehMaxSpeed <fct>,
## #   VehClass <fct>, ClaimAmount <dbl>, RiskVar <int>, Garage <fct>,
## #   ClaimInd <int>

There are 30,595 observations in the data.

We will use the majority of variables as predictors. Our target variable will be pure premium, taken as the ratio of loss to exposure.

We will fit pure premium models as elastic net GLMs with a Tweedie distribution and log link function. We will use H2O to fit models. H2O is a great tool for fitting models, with a very efficient algorithm for fitting regularized GLMs that also supports using a Tweedie distribution in fitting regularized GLMs.

We will perform a grid search in fitting the models. The grid will be over the \(\alpha\) and \(\lambda\) values. Recall \(\alpha\) controls the amount of \(L_1\) vs. \(L_2\) penalty, while \(\lambda\) controls the overall amount of regularization. Higher values of \(\lambda\) mean more regularization.

At each combination of alpha/lambda on our grid, we will fit an elastic net GLM with 10-fold cross-validation. We will compute a few metrics using the cross-validated predictions (area under a Lorenz curve, Cramer-von Mises distance between the distributions of actual and predicted loss, and the weighted variance of predicted values).

.params <- list(
  predictors = c(
    "VehAge", "VehAge_log", "Gender", 
    "MariStat", "SocioCateg", "VehUsage", "DrivAge", "DrivAge_log", 
    "DrivAge_sqrt", "DrivAge_cubert", "HasKmLimit", "BonusMalus", 
    "BonusMalus_log", "BonusMalus_sqrt", "VehBody", "VehEngine", 
    "VehEnergy", "VehMaxSpeed", "VehMaxSpeed_log", "VehClass", "RiskVar", 
    "RiskVar_log", "Garage"
  ),
  tgt_var = "target",
  wgt_var = "Exposure",
  n_folds = 10,
  lambda_min_ratio = 0.00001,
  nlambda = 100,
  alpha_seq = seq(0, 1, by = 0.1),
  fit_pars = list(
    family = "tweedie", 
    tweedie_variance_power = 1.5,
    tweedie_link_power = 0
  ),
  h2o_mem = "32G",
  h2o_cores = parallel::detectCores()
)

data <- freMPL1 %>%
  as_tibble() %>% 
  transmute(
    PolNumber = 1:n(),
    Exposure, 
    VehAge = readr::parse_number(as.character(VehAge)),
    VehAge_log = log(1 + VehAge),
    Gender,
    MariStat,
    SocioCateg = forcats::fct_lump_prop(SocioCateg, 0.01, w = Exposure),
    VehUsage,
    DrivAge,
    DrivAge_log = log(DrivAge),
    DrivAge_sqrt = sqrt(DrivAge),
    DrivAge_cubert = DrivAge ^ (1 / 3),
    HasKmLimit = factor(HasKmLimit, levels = c(0, 1), labels = c("No", "Yes")),
    BonusMalus,
    BonusMalus_log = log(BonusMalus),
    BonusMalus_sqrt = sqrt(BonusMalus),
    VehBody,
    VehEngine,
    VehEnergy,
    VehMaxSpeed = pmax(100, readr::parse_number(as.character(VehMaxSpeed))),
    VehMaxSpeed_log = log(VehMaxSpeed),
    VehClass, 
    RiskVar,
    RiskVar_log = log(RiskVar),
    Garage, 
    ClaimAmount,
    target = ClaimAmount / Exposure
  )

checkmate::assert_subset(
  .params$predictors,
  colnames(data)
)

data_model <- data %>%
  select(c(.params$predictors, .params$tgt_var, .params$wgt_var)) %>%
  mutate_if(is.character, factor) %>%
  mutate_at(.params$tgt_var, ~pmax(0, .)) %>%
  mutate(
    .fold = sample.int(.params$n_folds, nrow(.), replace = TRUE),
    .loss = !!sym(.params$tgt_var) * !!sym(.params$wgt_var)
  )
h2o.init(max_mem_size = .params$h2o_mem, nthreads = .params$h2o_cores)
data_model_h2o <- h2o_import_fast(data_model)

3 Find lambda.max

Our first step in constructing a grid of alpha/lambda values is to determine the value of \(\lambda_\max\), which is the smallest value of \(\lambda\) that drives all coefficients to zero in a lasso model (\(\alpha = 1\)).

For details on the algorithms for fitting elastic net models, see the article Regularization Paths for Generalized Linear Models via Coordinate Descent by Friedman, Hastie, and Tibshirani in the Journal of Statistical Software.

In Section 2.5 of the paper, note that in the case of an ordinary linear model, \(\lambda_\max\) is given by

\[ \lambda_\max= \frac{\max_\ell \left| \left\langle x_\ell, y \right\rangle \right|}{N\alpha} \]

Here \(\lambda_\max\) is for a specific value of \(\alpha\). For simplicity, we will refer to \(\lambda_\max\) as the value such that coefficients go to zero for \(\alpha = 1\) (i.e., lasso).

If we call \(\lambda_\max\) the value for \(\alpha = 1\) and \(\lambda_\max^\alpha\) the value for general \(\alpha\), we see that

\[ \lambda_\max^\alpha = \frac{\lambda_\max}{\alpha}. \]

In the case of generalized linear model, \(\lambda_\max\) can be expression in terms of deviance residuals, but this relationship still holds.

Thus to get \(\lambda_\max^\alpha\) for any value of \(\alpha\), we only need to get it for a single value of \(\alpha\). Here we will get the value \(\lambda_\max\) for \(\alpha = 1\).

lambda_max_glm_args <- rutils::list_edit(
  list(
    x = .params$predictors,
    y = .params$tgt_var,
    training_frame = data_model_h2o,
    weights_column = .params$wgt_var,
    alpha = 1,
    nlambdas = 1,
    lambda_search = TRUE
  ),
  !!!.params$fit_pars
) 

lambda_max_model <- purrr::lift(h2o::h2o.glm)(lambda_max_glm_args)

lambda_max <- lambda_max_model@parameters$lambda

coef_universe <- lambda_max_model@model$coefficients_table %>%
  as_tibble() %>%
  select(names) %>%
  filter(names != "Intercept") %>%
  tidyr::separate(names, c("var", "level"), sep = "\\.", remove = FALSE) %>%
  mutate(
    level = coalesce(level, var),
    var_grp = case_when(
      stringr::str_detect(var, "^VehAge") ~ "VehAge",
      stringr::str_detect(var, "^DrivAge") ~ "DrivAge",
      stringr::str_detect(var, "^BonusMalus") ~ "BonusMalus",
      stringr::str_detect(var, "^VehMaxSpeed") ~ "VehMaxSpeed",
      stringr::str_detect(var, "^RiskVar") ~ "RiskVar",
      TRUE ~ var
    )
  )

h2o::h2o.rm(lambda_max_model)

4 Model with no regularization

For subsequent comparison to the regularized models, we will fit a model without any regularization.

model_noreg_args <- rutils::list_edit(
  list(
    x = .params$predictors,
    y = .params$tgt_var,
    training_frame = data_model_h2o,
    weights_column = .params$wgt_var,
    lambda = 0,
    fold_column = ".fold",
    keep_cross_validation_predictions = TRUE
  ),
  !!!.params$fit_pars
) 

model_noreg <- purrr::lift_dl(h2o::h2o.glm)(model_noreg_args)

model_noreg_output <- tibble(
  alpha = NA_real_,
  log_lambda = -Inf,
  model = list(get_model_output(model_noreg))
) %>%
  tidyr::unnest_wider(model) %>%
  tidyr::unnest_wider(metrics)

h2o::h2o.rm(model_noreg)

5 Alpha/lambda grid

Now we construct our alpha/lambda grid. We will build the grid as follows. We specified a sequence of alpha values, along with the number of lambdas to use at each alpha.

For each alpha value, we will get a sequence of the number of specified lambdas which is evenly spaced on the log scale, going from the lambda min ratio times \(\lambda_\max\) to \(\lambda_max\) (where \(\lambda_\max\) is for that specific alpha value).

grid_params <- tibble(
  alpha = .params$alpha_seq,
  lambda_max = lambda_max / pmax(alpha, 0.01),
  lambda_min = if_else(alpha == 0, 0.1, 1) * .params$lambda_min_ratio * lambda_max,
  log_lambda = purrr::map2(
    lambda_min, lambda_max, 
    ~seq(log(.x), log(.y), length.out = .params$nlambda)
  ),
  width = (log(lambda_max) - log(lambda_min)) / (.params$nlambda - 1)
) %>%
  tidyr::unnest(log_lambda)

6 Fit models on grid

Next we fit cross-validated models at each alpha/lambda combination on our grid. We will compute the area under Lorenz curve, Cramer-von Mises distance between actual and predicted loss distributions, and variance of predicted values on the cross-validated predictions.

grid_model_output <- grid_params %>%
  mutate(
    model = purrr::pmap(
      list(alpha, log_lambda), 
      compute_grid_search_model,
      data_h2o = data_model_h2o,
      predictors = .params$predictors,
      tgt_var = .params$tgt_var,
      wgt_var = .params$wgt_var,
      fit_args = .params$fit_pars
    )
  ) %>% 
  tidyr::unnest_wider(model) %>% 
  tidyr::unnest_wider(metrics)

7 Metric variance estimates

At each point on our grid, we have computed model metrics, so we can evaluate which performed best. However, we want to know how significant of a difference there is between metric values. To get at this, we will compute a sort of estimate of the number of standard deviations by which the value of a metric at a grid point is below the best metric value.

In order to do this, we will first get an estimate of the variance in the metric value at the optimal point. We will find the grid point with the optimal value of area under Lorenz curve of Cramer-von Mises distance, then fit 10 more sets of cross-validated predictions with different random fold assignments. We will compute the metrics on each of these sets of predictions, then calculate the variance in the metric values.

These variance estimates will then be used to compute an estimated “drop” number of standard errors of the value of each metric at each grid point compared to the best value for that metric.

One can then choose an optimal grid point as, say, the point with the lowest sum of Lorenz AUC drop and Cramer-von Mises distance drop (or according to some weighting where one metric is given more weight than the other, depending on what one wishes to optimize).

fold_metric_var <- purrr::map(
  1:10, ~sample.int(10, nrow(data_model), replace = TRUE)
) %>%
  rlang::set_names(paste0(".fold", 1:10)) %>%
  as_tibble()

fold_metric_var_h2o <- h2o_import_fast(fold_metric_var)

data_model_h2o_metric_var <- h2o::h2o.cbind(data_model_h2o, fold_metric_var_h2o)

lorenz_top <- grid_model_output %>%
  arrange(desc(lorenz_auc)) %>%
  slice(1)

lorenz_auc_var <- purrr::map(
  paste0(".fold", 1:10),
  function(fold) {
    compute_grid_search_model(
      alpha = lorenz_top$alpha,
      log_lambda = lorenz_top$log_lambda,
      data_h2o = data_model_h2o_metric_var,
      predictors = .params$predictors,
      tgt_var = .params$tgt_var,
      wgt_var = .params$wgt_var,
      fit_args = .params$fit_pars,
      fold_col = fold
    )
  }
)

lorenz_auc_cv_sd <- lorenz_auc_var %>%
  purrr::map_dbl(purrr::pluck, "metrics", "lorenz_auc") %>%
  sd()

grid_model_output_auc_drop <- grid_model_output %>%
  mutate(lorenz_auc_drop_sd = (max(lorenz_auc) - lorenz_auc) / lorenz_auc_cv_sd) 

dist_top <- grid_model_output_auc_drop %>%
  filter(percent_rank(lorenz_auc_drop_sd) <= 0.25) %>%
  arrange(prem_dist) %>%
  slice(1)

prem_dist_var <- purrr::map(
  paste0(".fold", 1:10),
  function(fold) {
    compute_grid_search_model(
      alpha = dist_top$alpha,
      log_lambda = dist_top$log_lambda,
      data_h2o = data_model_h2o_metric_var,
      predictors = .params$predictors,
      tgt_var = .params$tgt_var,
      wgt_var = .params$wgt_var,
      fit_args = .params$fit_pars,
      fold_col = fold
    )
  }
)

prem_dist_cv_sd <- prem_dist_var %>%
  purrr::map_dbl(purrr::pluck, "metrics", "prem_dist") %>%
  sd()

grid_model_output <- grid_model_output_auc_drop %>%
  mutate(
    prem_dist_drop_sd = (prem_dist - min(prem_dist)) / prem_dist_cv_sd,
    drop_combined = lorenz_auc_drop_sd + prem_dist_drop_sd
  )

h2o::h2o.rm(list(fold_metric_var_h2o, data_model_h2o_metric_var))
grid_ll <- grid_model_output %>%
  summarise(
    range = diff(range(log_lambda)), 
    min = min(log_lambda)
  )

grid_metrics_plot <- bind_rows(
  grid_model_output,
  model_noreg_output %>%
    mutate(
      alpha = 0,
      log_lambda = grid_ll$min - 0.1 * grid_ll$range, 
      width = grid_ll$range * 0.05
    )
)

8 Area under Lorenz curve

Here we plot the value of the area under the Lorenz curve at each point of the alpha/lambda grid. We also plot the log of the difference between the maximum Lorenz area and that of the cell. This is visualizing the same thing, but may help illustrate results better in some instances.

At the bottom left (corresponding to alpha of 0 and a low value of lambda), the result from the model with no regularization is plotted. It can be seen that regularization can improve performance on this metric.

ggplot(data = grid_metrics_plot) + 
  theme_minimal() +
  geom_tile(
    aes(
      x = log_lambda,
      y = alpha,
      fill = lorenz_auc,
      width = width
    )
  ) +
  viridis::scale_fill_viridis(direction = -1, option = "C") +
  scale_y_continuous(breaks = .params$alpha_seq) +
  theme(
    panel.grid.major.y = element_blank(), 
    legend.title = element_blank()
  ) + 
  ggtitle("Area under Lorenz curve", subtitle = "Higher is better") +
  xlab(expression(log(lambda))) + 
  ylab(expression(alpha))

ggplot(data = grid_metrics_plot) + 
  theme_minimal() +
  geom_tile(
    aes(
      x = log_lambda,
      y = alpha,
      fill = log(max(lorenz_auc) - lorenz_auc + lorenz_auc_cv_sd),
      width = width
    )
  ) +
  viridis::scale_fill_viridis(direction = 1, option = "C") +
  scale_y_continuous(breaks = .params$alpha_seq) +
  theme(
    panel.grid.major.y = element_blank(), 
    legend.title = element_blank()
  ) + 
  ggtitle(
    "Area under Lorenz curve", 
    subtitle = glue::glue(
      "Log of difference between maximum area and that of cell",
      "\n(more negative is better)"
    )
  ) +
  xlab(expression(log(lambda))) + 
  ylab(expression(alpha))

9 Difference between predicted and actual distributions

Now we plot the Cramer-von Mises distance between the actual and predicted loss distributions at each grid point. Similarly to the Lorenz curve area plots, we also plot the log of the Cramer-von Mises statistic, which may make the visualization of results more vivid.

As previously, at the bottom left (corresponding to alpha of 0 and a low value of lambda), the result from the model with no regularization is plotted. It can be seen that regularization can improve performance on this metric.

ggplot(data = grid_metrics_plot) + 
  theme_minimal() +
  geom_tile(
    aes(
      x = log_lambda, 
      y = alpha, 
      fill = prem_dist,
      width = width
    )
  ) +
  viridis::scale_fill_viridis(direction = 1, option = "C") +
  scale_y_continuous(breaks = .params$alpha_seq) +
  theme(
    panel.grid.major.y = element_blank(), 
    legend.title = element_blank()
  ) + 
  ggtitle(
    "Difference between predicted loss distribution and\nactual loss distribution", 
    subtitle = "Cramer-von Mises statistic (lower is better)"
  ) +
  xlab(expression(log(lambda))) + 
  ylab(expression(alpha))

ggplot(data = grid_metrics_plot) + 
  theme_minimal() +
  geom_tile(
    aes(
      x = log_lambda, 
      y = alpha, 
      fill = log(prem_dist),
      width = width
    )
  ) +
  viridis::scale_fill_viridis(direction = 1, option = "C") +
  scale_y_continuous(breaks = .params$alpha_seq) +
  theme(
    panel.grid.major.y = element_blank(), 
    legend.title = element_blank()
  ) + 
  ggtitle(
    "Difference between predicted loss distribution and\nactual loss distribution", 
    subtitle = "Log of Cramer-von Mises statistic (lower is better)"
  ) +
  xlab(expression(log(lambda))) + 
  ylab(expression(alpha))

10 Amount of segmentation

Here we plot the weighted standard deviation of predicted values (weighted by exposure) at each alpha/lambda grid point. This can be seen as a sort of measure of the amount of segmentation generated by a model.

ggplot(data = grid_metrics_plot) + 
  theme_minimal() +
  geom_tile(
    aes(
      x = log_lambda, 
      y = alpha,
      fill = sqrt(var), 
      width = width
    )
  ) +
  viridis::scale_fill_viridis(direction = -1, option = "C") +
  scale_y_continuous(breaks = .params$alpha_seq) +
  theme(
    panel.grid.major.y = element_blank(), 
    legend.title = element_blank()
  ) +
  ggtitle(
    "Weighted standard deviation of predictions", 
    subtitle = "Higher means more segmentation"
  ) +
  xlab(expression(log(lambda))) + 
  ylab(expression(alpha))

11 Table of metrics on grid

The following table provides the results of all the metrics calculated at each point on the alpha/lambda grid.

grid_model_output %>%
  arrange(drop_combined, lorenz_auc_drop_sd, prem_dist_drop_sd, desc(var)) %>%
  transmute(
    Alpha = alpha,
    Lambda = exp(log_lambda),
    `Log lambda` = log_lambda,
    `Lorenz AUC` = lorenz_auc,
    `AUC drop (std deviations)` = lorenz_auc_drop_sd,
    `Predicted/actual difference (Cramer-von Mises)` = prem_dist,
    `Predicted/actual drop (std deviations)` = prem_dist_drop_sd,
    `Variance of predictions` = var,
    `Combined AUC/distance drop` = drop_combined
  ) %>%
  DT::datatable(
    extensions = "FixedColumns",
    rownames = FALSE,
    options = list(
      fixedColumns = list(leftColumns = 2),
      scrollX = TRUE
    ),
    width = "auto",
    height = "auto"
  ) %>%
  DT::formatRound(
    c("Lambda", "Log lambda"), 
    digits = 4
  ) %>%
  DT::formatRound(
    c("Lorenz AUC", "Predicted/actual difference (Cramer-von Mises)"), 
    digits = 6
  ) %>%
  DT::formatRound(
    "Variance of predictions", 
    digits = 2
  ) %>%
  DT::formatRound(
    c("AUC drop (std deviations)", 
      "Predicted/actual drop (std deviations)", 
      "Combined AUC/distance drop"),
    digits = 3
  )

12 Regularization paths

In this section, we plot the full regularization paths for each variable in the model. Model terms have been logically grouped and plotted together (i.e., categorical levels are plotted together; all transformations/versions of a numeric variable are grouped and plotted together). The regularization path is plotted for each value of alpha. On each alpha facet, the vertical line corresponds to the value of lambda that was the best-performing model (defined here as the model with the smallest sum of drops in Lorenz AUC and Cramer-von Mises distance) for that value of alpha.

coef_path_by_var <- grid_model_output %>%
  select(alpha, log_lambda, coef) %>%
  tidyr::unnest(coef) %>%
  inner_join(coef_universe, by = "names") %>%
  tidyr::nest(coef_df = c(-var_grp))

optimal_lambda <- grid_model_output %>% 
  group_by(alpha) %>% 
  filter(row_number(drop_combined) == 1) %>%
  ungroup()

purrr::pwalk(
  as.list(coef_path_by_var),
  function(coef_df, var_grp, lambda_lines) {
    plot <- plot_regularization_path(coef_df, var_grp, lambda_lines)
    cat("##", var_grp, "\n\n")
    print(plot)
    cat("\n\n")
  },
  lambda_lines = optimal_lambda
)

12.1 SocioCateg

12.2 VehBody

12.3 VehEngine

12.4 VehClass

12.5 VehEnergy

12.6 VehUsage

12.7 Garage

12.8 HasKmLimit

12.9 Gender

12.10 MariStat

12.11 VehAge

12.12 DrivAge

12.13 BonusMalus

12.14 VehMaxSpeed

12.15 RiskVar